Descriptive Statistics with R

Author

Martin Schweinberger

Introduction

This tutorial introduces descriptive statistics — the methods we use to summarise, organise, and communicate what a dataset contains. Before we can test hypotheses or draw inferences about populations, we need to understand and describe what our data actually look like. Descriptive statistics provide the tools to do exactly that.

Descriptive statistics help us understand the main features of a dataset: its central tendency (where the data cluster), its variability (how spread out the values are), and its distribution (the shape of the data). These summaries offer valuable insights for analysing and interpreting quantitative data across all domains, including linguistics and the humanities (Bickel and Lehmann 2012; Thompson 2009).

This tutorial is aimed at beginners and intermediate R users. The goal is not to provide a complete analysis but to showcase selected, practically useful methods for describing and summarising data, illustrated with real and realistic linguistic examples.

Prerequisite Tutorials


Why Descriptive Statistics?

To see why data summaries are useful, consider the following scenario. You are teaching two different classes in the same school, at the same grade level. Both classes take the same exam. After grading, someone asks: “Which class performed better?” You could say: “Class A had 5 Bs, 10 Cs, 12 Ds, and 2 Fs, while Class B had 2 As, 8 Bs, 10 Ds, and 4 Fs” — but this answer is unsatisfying and hard to compare.

Descriptive statistics allow you to summarise complex datasets in just a few numbers or words. This is the practical power of descriptive statistics: they compress data without losing its essential character.

What This Tutorial Covers

Statistics can be divided into two main branches:

  • Descriptive statistics: Describes and visualises data (this tutorial)
  • Inferential statistics: Tests hypotheses and draws conclusions about populations from samples

This tutorial covers:

  1. Measures of central tendency — mean, median, mode, geometric mean, harmonic mean
  2. Measures of variability — range, IQR, variance, SD, SE
  3. Confidence intervals — parametric, bootstrapped, nominal, multinomial
  4. Data distributions — histograms, density plots, Q-Q plots, skewness, kurtosis
  5. Comprehensive summariessummary(), psych::describe(), publication-ready tables
  6. Frequency tables and cross-tabulationstable(), prop.table(), grouped bar charts
  7. Data visualisation — boxplots, violin plots, bar charts, scatterplots
  8. Correlation — Pearson’s r, Spearman’s ρ, correlation matrices
  9. Outlier detection — z-score method, IQR fences, visualisation
  10. Robust measures — trimmed mean, Winsorisation, MAD
  11. Working with real data — loading, inspecting, missing values, data quality
  12. Effect sizes — Cohen’s d, Pearson’s r, eta-squared (η²)
  13. Reporting standards — APA conventions, model paragraphs and tables

Preparation: Installing and Loading Packages

This tutorial requires several R packages. If you have not yet installed them, run the code below. This only needs to be done once.

Code
# install required packages  
install.packages("dplyr")  
install.packages("ggplot2")  
install.packages("stringr")  
install.packages("boot")  
install.packages("DescTools")  
install.packages("ggpubr")  
install.packages("flextable")  
install.packages("psych")  
install.packages("Rmisc")  
install.packages("checkdown")  

Once installed, load the packages as shown below. We also set two global options that suppress scientific notation and prevent automatic factor conversion.

Code
# set global options  
options(stringsAsFactors = FALSE)         # no automatic data transformation  
options("scipen" = 100, "digits" = 12)   # suppress scientific notation  
  
# load packages  
library(boot)  
library(DescTools)  
library(dplyr)  
library(stringr)  
library(ggplot2)  
library(flextable)  
library(ggpubr)  
library(psych)  
library(Rmisc)  
library(checkdown)  

Measures of Central Tendency

A measure of central tendency is a single value that represents the “typical” or “central” value in a distribution. In linguistics and the language sciences, three measures are particularly important: the mean, the median, and the mode (Gaddis and Gaddis 1990).

Two further measures — the geometric mean and the harmonic mean — are occasionally relevant for specific types of data and are briefly covered below.

Choosing the Right Measure

The appropriate measure of central tendency depends on:

  1. The type of variable (nominal, ordinal, interval, ratio)
  2. The distribution of the data (normal vs. skewed; presence of outliers)
  3. The purpose of the summary

The table below gives an overview.

Measure

When to Use

(Arithmetic) mean

Numeric variables with approximately normal distributions; most common measure of central tendency

Median

Numeric or ordinal variables with non-normal distributions, skew, or influential outliers

Mode

Nominal and categorical variables; also useful for any variable to identify the most frequent value

Geometric mean

Dynamic processes involving multiplicative change (e.g., growth rates, compound interest)

Harmonic mean

Dynamic processes involving rates and velocities (e.g., average speed across equal distances)


The (Arithmetic) Mean

The mean is the most widely used measure of central tendency. It is appropriate when data are numeric and approximately normally distributed.

The mean is calculated by summing all values and dividing by the number of values:

\[\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{x_1 + x_2 + \cdots + x_n}{n}\]

Note: Both distributions above have a mean of 5, but their shapes are very different. This illustrates that the mean alone does not fully describe a distribution — measures of variability are also essential.

A Worked Example: Sentence Length in Moby Dick

Suppose we want to calculate the mean sentence length (in words) in the opening paragraph of Herman Melville’s Moby Dick. We first tabulate the sentences and their lengths:

Sentences

Words

Call me Ishmael.

3

Some years ago — never mind how long precisely — having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.

40

It is a way I have of driving off the spleen, and regulating the circulation.

15

Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off — then, I account it high time to get to sea as soon as I can.

87

To calculate the mean, we divide the total word count by the number of sentences:

\[\bar{x} = \frac{3 + 40 + 15 + 87}{4} = \frac{145}{4} = 36.25\]

The mean sentence length is 36.25 words. In R:

Code
# create vector of word counts per sentence  
sentence_lengths <- c(3, 40, 15, 87)  
# calculate mean  
mean(sentence_lengths)  
[1] 36.25
Limitation of the Mean: Sensitivity to Outliers

The mean is strongly pulled towards extreme values. In our example, the 87-word sentence dramatically inflates the mean compared to the shorter sentences. When data contain outliers or are skewed, the median is often a more representative measure of central tendency.


Exercises: The Mean

Q1. What is the arithmetic mean of: 1, 2, 3, 4, 5, 6?






Q2. Which R function calculates the arithmetic mean of a numeric vector?






Q3. A corpus contains five texts with the following word counts: 200, 250, 210, 180, and 1500. Why might the mean NOT be the best summary of typical text length here?






The Median

The median is the middle value when data are arranged in ascending or descending order. Exactly 50% of values lie below the median and 50% lie above it. The median is more robust than the mean because it is not affected by extreme values.

\[\text{median}(x) = \begin{cases} x_{\frac{n+1}{2}} & \text{if } n \text{ is odd} \\ \frac{1}{2}\left(x_{\frac{n}{2}} + x_{\frac{n}{2}+1}\right) & \text{if } n \text{ is even} \end{cases}\]

A Linguistic Example: Speaker Age in the ICE-Ireland Corpus

Consider the age distribution of speakers in the private dialogue section of the Irish component of the International Corpus of English (ICE-Ireland):

Age

Counts

0–18

9

19–25

160

26–33

70

34–41

15

42–49

9

50+

57

The age groups are ordinal: they have a natural order (young to old) but we cannot assume equal intervals between groups. The appropriate measure is the median, not the mean.

To find the median, we create a vector where each speaker’s age rank appears as many times as there are speakers in that group:

Code
# create a ranked vector (1 = youngest group, 6 = oldest)  
age_ranks <- c(rep(1, 9), rep(2, 160), rep(3, 70),  
               rep(4, 15), rep(5, 9),  rep(6, 57))  
# total speakers  
length(age_ranks)  # 320 speakers  
[1] 320
Code
# find the median rank  
median(age_ranks)  
[1] 2

The median rank is 2, corresponding to the 19–25 age group. This tells us the median speaker in ICE-Ireland private dialogues is between 19 and 25 years old.

Mean vs. Median: When Does It Matter?

The mean and median agree when distributions are symmetric. They diverge when distributions are skewed or contain outliers:

Situation Preferred measure
Normal distribution, no outliers Mean
Skewed distribution (e.g., text lengths, corpus frequencies) Median
Ordinal variables (e.g., Likert scales, age groups) Median
Influential extreme values Median

In corpus linguistics, frequency data is almost always right-skewed: a few very frequent items dominate. The median is therefore often more informative than the mean for frequency distributions.


Exercises: The Median

Q1. What is the median of: 1, 2, 3, 4, 5, 6?






Q2. A researcher has reaction time data with several extremely slow responses (>2000ms). Which measure of central tendency should she report as the primary summary?






Q3. When is the median equal to the mean?






The Mode

The mode is the most frequently occurring value (or category) in a distribution. It is the only measure of central tendency that can be used with nominal (categorical) variables.

Example: Speaker Residence in ICE-Ireland

CurrentResidence

Speakers

Belfast

98

Down

20

Dublin (city)

110

Limerick

13

Tipperary

19

The mode is Dublin (city), with 110 speakers — the single largest group. In R:

Code
# create a factor with current residences  
residence <- c(  
  rep("Belfast", 98),  
  rep("Down", 20),  
  rep("Dublin (city)", 110),  
  rep("Limerick", 13),  
  rep("Tipperary", 19)  
)  
# calculate mode: find which level occurs most frequently  
names(which.max(table(residence)))  
[1] "Dublin (city)"
R Has No Built-in mode() Function for Statistics

R’s built-in mode() function returns the storage mode of an object (e.g., “numeric”, “character”) — not the statistical mode. To find the most frequent value, use names(which.max(table(x))) as shown above.

Caution: which.max() returns only the first maximum. If two categories have equal frequency (a bimodal distribution), only one is returned. Always inspect the full frequency table (table(x)) to detect ties.


Exercises: The Mode

Q1. For which type of variable is the mode the ONLY appropriate measure of central tendency?






Q2. A corpus linguist finds that ‘the’ appears 4000 times, ‘a’ appears 2500 times, and ‘of’ appears 1800 times in a sample. What is the mode of word frequency in this sample?






When Measures of Centrality Diverge: A Cautionary Example

The following example illustrates why the choice of measure matters. Consider discourse particle use (per 1,000 words) across five speakers in two different corpora:

Corpus

Speaker

Frequency

C1

A

11.4

C1

B

5.2

C1

C

27.1

C1

D

9.6

C1

E

13.7

C2

A

0.2

C2

B

0.0

C2

C

1.1

C2

D

65.3

C2

E

0.4

Both corpora have a mean of 13.4 particles per 1,000 words — identical by the mean. But the distributions are entirely different:

  • Corpus 1: Values are fairly evenly distributed across speakers (range: 5.2–27.1)
  • Corpus 2: Four speakers use almost no particles (0.0–1.1), but Speaker D uses them at an extremely high rate (65.3), inflating the mean

The median reveals this difference clearly:
- Corpus 1 median: 11.4 (reflects the typical speaker)
- Corpus 2 median: 0.4 (reflects the typical speaker — very different from the mean of 13.4!)

Key Lesson

Always report the median alongside the mean, especially when:

  • The distribution is known or suspected to be non-normal
  • You are working with corpus frequency data (almost always right-skewed)
  • Outliers may be present
  • The variable is ordinal rather than truly numeric

The mean and median together tell a richer story than either alone.


Geometric Mean

The geometric mean is used for dynamic processes where later values depend on earlier ones — particularly for growth rates and multiplicative processes.

\[\bar{x}_{\text{geom}} = \sqrt[n]{x_1 \times x_2 \times \cdots \times x_n}\]

Example: Comparing two investment packages

Year

Package1

Package2

Year 1

+5%

+20%

Year 2

−5%

−20%

Year 3

+5%

+20%

Year 4

−5%

−20%

The arithmetic means of both packages are 0% — they appear identical. But the geometric means reveal the truth:

  • Package 1: \(1.05 \times 0.95 \times 1.05 \times 0.95 = 0.995\) → 0.5% total loss → ~0.125% loss/year
  • Package 2: \(1.20 \times 0.80 \times 1.20 \times 0.80 = 0.922\) → 7.8% total loss → ~2% loss/year

Package 2 performs much worse because percentage gains and losses are not symmetric: a 20% loss requires more than a 20% gain to recover.


Harmonic Mean

The harmonic mean gives the average rate when each component contributes equally to a fixed total amount of work or distance.

\[\bar{x}_{\text{harm}} = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + \cdots + \frac{1}{x_n}}\]

Example: You drive 60 km to work at 30 kph and return at 60 kph. The distance is equal in both directions. What is your average speed?

\[\bar{x}_{\text{harm}} = \frac{2}{\frac{1}{30} + \frac{1}{60}} = \frac{2}{\frac{3}{60}} = 40 \text{ kph}\]

At 40 kph both ways, the trip takes the same total time (3 hours) as the original unequal speeds. The arithmetic mean (45 kph) would give the wrong answer: at 45 kph both ways, the trip would only take 2h40min.


Measures of Variability

Measures of variability describe how spread out or dispersed values are around the central tendency. Two datasets can have identical means but very different distributions — as the Moscow/Hamburg temperature example below illustrates.

Why Variability Matters

Without variability measures, central tendency statistics alone are misleading. Two cities can have the same annual mean temperature but vastly different climates. Two speakers can have the same mean utterance length but very different speaking styles. Always report a measure of variability alongside any measure of central tendency.

Month

Moscow

Hamburg

January

-5.00

7.00

February

-12.00

7.00

March

5.00

8.00

April

12.00

9.00

May

15.00

10.00

June

18.00

13.00

July

22.00

15.00

August

23.00

15.00

September

20.00

13.00

October

16.00

11.00

November

8.00

8.00

December

1.00

7.00

**Mean**

10.25

10.25


Range

The range is the simplest measure of variability: it reports the minimum and maximum values of a distribution.

Code
Moscow <- c(-5, -12, 5, 12, 15, 18, 22, 23, 20, 16, 8, 1)  
Hamburg <- c(7, 7, 8, 9, 10, 13, 15, 15, 13, 11, 8, 7)  
  
# range for Moscow  
min(Moscow); max(Moscow)  
[1] -12
[1] 23
Code
# range for Hamburg  
min(Hamburg); max(Hamburg)  
[1] 7
[1] 15

Moscow ranges from −12°C to 23°C (a span of 35 degrees), while Hamburg ranges from 7°C to 15°C (a span of only 8 degrees). The range confirms the stark climatic difference.

Limitation of the range: The range only uses the two most extreme values and is therefore highly sensitive to outliers. A single extreme observation dramatically changes the range without representing the bulk of the data.


Interquartile Range (IQR)

The interquartile range (IQR) is the range that encompasses the central 50% of data points. It is calculated as the difference between the third quartile (Q3, 75th percentile) and the first quartile (Q1, 25th percentile).

\[\text{IQR} = Q_3 - Q_1\]

Because the IQR excludes the top and bottom 25% of the data, it is robust to outliers and more informative than the range for skewed distributions.

Code
# full summary including quartiles  
summary(Moscow)  
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -12.00    4.00   13.50   10.25   18.50   23.00 
Code
# calculate IQR directly  
IQR(Moscow)  
[1] 14.5
Code
IQR(Hamburg)  
[1] 5.25

For Moscow: Q1 = 1°C, Q3 = 18.5°C → IQR = 17.5°C
For Hamburg: Q1 = 7.75°C, Q3 = 13°C → IQR = 5.25°C

The IQR confirms what the range suggested: Moscow has far greater seasonal temperature variability than Hamburg.

Visualising Variability: The Boxplot

The boxplot is the standard visualisation for the IQR. The box spans Q1 to Q3 (the IQR), the middle line marks the median, and the whiskers extend to the most extreme non-outlier values.


Variance

The variance quantifies how much individual values deviate from the mean, on average. It is calculated as the average of the squared deviations from the mean:

\[s^2 = \frac{1}{n-1} \sum_{i=1}^{n}(x_i - \bar{x})^2\]

The reason we square the deviations is to avoid positive and negative deviations cancelling each other out. We divide by \(n-1\) (rather than \(n\)) to obtain an unbiased estimate of the population variance.

Code
# variance of Moscow temperatures  
var(Moscow)  
[1] 123.659090909
Code
# variance of Hamburg temperatures  
var(Hamburg)  
[1] 9.47727272727

The variance of Moscow temperatures (≈ 123.7) is roughly 13× larger than Hamburg’s (≈ 9.5), reflecting the much greater seasonal fluctuation.

Limitation: The variance is expressed in squared units (°C²), which makes it difficult to interpret directly. The standard deviation addresses this by taking the square root.


Standard Deviation

The standard deviation (SD) is the square root of the variance. It is expressed in the same units as the original data and is by far the most commonly reported measure of variability.

\[\sigma = \sqrt{s^2} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n}(x_i - \bar{x})^2}\]

Code
# standard deviation of Moscow temperatures  
sd(Moscow)  
[1] 11.1202109202
Code
# standard deviation of Hamburg temperatures  
sd(Hamburg)  
[1] 3.07851794331

Moscow: SD ≈ 11.12°C — on average, monthly temperatures deviate 11 degrees from the annual mean.
Hamburg: SD ≈ 2.99°C — temperatures stay within about 3 degrees of the mean throughout the year.

Interpreting the Standard Deviation

For a normally distributed variable, approximately:
- 68% of values fall within ±1 SD of the mean
- 95% of values fall within ±2 SD of the mean
- 99.7% of values fall within ±3 SD of the mean

This is the empirical rule (or 68-95-99.7 rule). It provides a practical benchmark for assessing whether a particular value is typical or unusual.


Exercises: Measures of Variability

Q1. What does the IQR measure?






Q2. Why is the standard deviation reported more often than the variance?






Q3. Researcher A reports a mean corpus frequency of 50 per million words (SD = 2). Researcher B reports a mean of 50 per million words (SD = 40). What does the difference in SD tell us?






Q4. For the following two vectors, calculate mean, median, and standard deviation in R. Which vector has higher variability?

A: 1, 3, 6, 2, 1, 1, 6, 8, 4, 2, 3, 5, 0, 0, 2, 1, 2, 1, 0
B: 3, 2, 5, 1, 1, 4, 0, 0, 2, 3, 0, 3, 0, 5, 4, 5, 3, 3, 4






Standard Error

The standard error of the mean (SEM) is a measure of how precisely the sample mean estimates the true population mean. It depends on both the standard deviation and the sample size:

\[\text{SE}_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\]

A larger sample size reduces the standard error — larger samples produce more precise estimates of the population mean. This is why replication and large samples matter in empirical research.

To illustrate, we use a dataset of reaction times from a lexical decision task (participants decide whether a letter string is a real word):

RT

State

Gender

429.276

Sober

Male

435.473

Sober

Male

394.535

Sober

Male

377.325

Sober

Male

430.294

Sober

Male

289.102

Sober

Female

411.505

Sober

Female

366.191

Sober

Female

365.792

Sober

Female

334.034

Sober

Female

444.188

Drunk

Male

540.866

Drunk

Male

468.531

Drunk

Male

476.011

Drunk

Male

412.473

Drunk

Male

520.845

Drunk

Female

435.682

Drunk

Female

463.421

Drunk

Female

536.036

Drunk

Female

494.936

Drunk

Female

We can calculate the SEM manually:

Code
# manual calculation of standard error  
sd(rts$RT, na.rm = TRUE) / sqrt(length(rts$RT[!is.na(rts$RT)]))  
[1] 14.7692485022

Or more conveniently, using the describe() function from the psych package:

Code
# describe() from psych provides SE alongside other summary statistics  
psych::describe(rts$RT, type = 2)  
   vars  n   mean    sd median trimmed  mad   min    max  range skew kurtosis
X1    1 20 431.33 66.05 432.88   432.9 60.4 289.1 540.87 251.76 -0.2    -0.13
      se
X1 14.77
Standard Error vs. Standard Deviation

These two measures are frequently confused:

Measure What it describes
Standard deviation (SD) Variability of individual data points around the mean
Standard error (SE) Precision of the sample mean as an estimate of the population mean

The SD tells us about the spread of the data. The SE tells us about the precision of our estimate. SE = SD / √n, so larger samples always have smaller SE.


Confidence Intervals

A confidence interval (CI) is a range of values that is likely to contain the true population parameter (typically the mean) with a specified level of confidence (typically 95%).

More precisely: if you repeated your study many times and constructed a 95% CI each time, 95% of those intervals would contain the true population mean. It does not mean “there is a 95% probability the true mean is in this particular interval” (a common misconception).

Confidence intervals are important because they quantify the uncertainty in our estimates and communicate how precisely we can estimate a population value from a sample.

The confidence interval for the mean is calculated as:

\[\bar{x} \pm z \cdot \frac{s}{\sqrt{n}}\]

For a 95% CI with normally distributed data, the z-value is 1.96 (the value that cuts off 2.5% in each tail of the standard normal distribution):

Code
# verify the z-value for 95% CI  
qnorm(0.975)   # 2.5% in the upper tail → z = 1.96  
[1] 1.95996398454

A practical example — calculating the 95% CI for a vector of values:

Code
values <- c(4, 5, 2, 3, 1, 4, 3, 6, 3, 2, 4, 1)  
  
m <- mean(values)  
s <- sd(values)  
n <- length(values)  
  
lower <- m - 1.96 * (s / sqrt(n))  
upper <- m + 1.96 * (s / sqrt(n))  
  
cat("Lower CI:", round(lower, 3), "\n")  
Lower CI: 2.302 
Code
cat("Mean:    ", round(m, 3), "\n")  
Mean:     3.167 
Code
cat("Upper CI:", round(upper, 3), "\n")  
Upper CI: 4.031 

Exercises: Confidence Intervals

Q1. What does a 95% confidence interval tell us?






Q2. A researcher finds a 95% CI of [48.2, 51.8] for a mean corpus frequency. What can she conclude?






Q3. Two studies estimate the same mean (50.0), but Study A has CI [45, 55] and Study B has CI [49, 51]. What explains the difference?






Confidence Intervals in R: Multiple Methods

For a Simple Vector

The CI() function from the Rmisc package is the most convenient approach:

Code
Rmisc::CI(rts$RT, ci = 0.95)  
        upper          mean         lower 
462.238192381 431.325800000 400.413407619 

Alternatively, use t.test() (which also tests H₀: μ = 0):

Code
stats::t.test(rts$RT, conf.level = 0.95)  

    One Sample t-test

data:  rts$RT
t = 29.20431598, df = 19, p-value < 0.0000000000000002220446
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 400.413407619 462.238192381
sample estimates:
mean of x 
 431.3258 

Or MeanCI() from DescTools:

Code
DescTools::MeanCI(rts$RT, conf.level = 0.95)  
         mean        lwr.ci        upr.ci 
431.325800000 400.413407619 462.238192381 

Bootstrapped Confidence Intervals

Bootstrapping is a powerful, assumption-free alternative. Rather than relying on the formula \(\bar{x} \pm 1.96 \times \text{SE}\) (which assumes normality), bootstrapping repeatedly resamples the data and builds an empirical distribution of the sample mean.

Code
# bootstrapped CI using MeanCI  
DescTools::MeanCI(rts$RT, method = "boot", type = "norm", R = 1000)  
         mean        lwr.ci        upr.ci 
431.325800000 402.092500763 459.569724437 
When to Use Bootstrapped CIs

Bootstrapped confidence intervals are preferable when:

  • The data are not normally distributed and the sample is small
  • You cannot assume the parametric formula is appropriate
  • You want an empirical, data-driven estimate of uncertainty

For large samples, parametric and bootstrapped CIs typically converge. Because bootstrapping is random (resampling), the results will vary slightly between runs. Use a fixed set.seed() for reproducibility.

Code
# manual bootstrapping with the boot package  
set.seed(42)  
  
BootFunction <- function(x, index) {  
  return(c(  
    mean(x[index]),  
    var(x[index]) / length(index)  
  ))  
}  
  
Bootstrapped <- boot(  
  data      = rts$RT,  
  statistic = BootFunction,  
  R         = 1000  
)  
  
# extract bootstrapped mean and 95% CI  
mean(Bootstrapped$t[, 1])  
[1] 430.72580535
Code
boot.ci(Bootstrapped, conf = 0.95)  
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = Bootstrapped, conf = 0.95)

Intervals : 
Level      Normal              Basic             Studentized     
95%   (403.2, 460.7 )   (403.4, 461.1 )   (399.0, 462.5 )  

Level     Percentile            BCa          
95%   (401.6, 459.2 )   (402.4, 459.7 )  
Calculations and Intervals on Original Scale

Confidence Intervals for Grouped Data

Use summarySE() from Rmisc to extract mean and CI by group:

Code
Rmisc::summarySE(  
  data          = rts,  
  measurevar    = "RT",  
  groupvars     = "Gender",  
  conf.interval = 0.95  
)  
  Gender  N       RT            sd            se            ci
1 Female 10 421.7544 82.8522922089 26.2001952746 59.2689594071
2   Male 10 440.8972 46.2804393809 14.6351599557 33.1070319225

Confidence Intervals for Nominal (Proportional) Data

When the data are nominal (e.g., did a speaker use a particular construction: yes/no?), we use binomial confidence intervals. Here we test whether 2 successes out of 20 trials differ from a chance level of 0.5:

Code
stats::binom.test(  
  x           = 2,    # observed successes  
  n           = 20,   # total trials  
  p           = 0.5,  # hypothesised probability  
  alternative = "two.sided",  
  conf.level  = 0.95  
)  

    Exact binomial test

data:  2 and 20
number of successes = 2, number of trials = 20, p-value =
0.000402450562
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.0123485271703 0.3169827140191
sample estimates:
probability of success 
                   0.1 

The BinomCI() function from DescTools offers more method options:

Code
DescTools::BinomCI(  
  x          = 2,  
  n          = 20,  
  conf.level = 0.95,  
  method     = "modified wilson"  
)  
     est          lwr.ci         upr.ci
[1,] 0.1 0.0177680755349 0.301033645228

Confidence Intervals for Multinomial Data

When there are more than two nominal categories (e.g., which of several linguistic constructions was chosen), use MultinomCI():

Code
# observed counts for four categories  
observed <- c(35, 74, 22, 69)  
  
DescTools::MultinomCI(  
  x          = observed,  
  conf.level = 0.95,  
  method     = "goodman"  
)  
       est          lwr.ci         upr.ci
[1,] 0.175 0.1180188110631 0.251643112255
[2,] 0.370 0.2898697185019 0.457995050825
[3,] 0.110 0.0661144727952 0.177479835187
[4,] 0.345 0.2668784298282 0.432498795139

Exercises: Calculating CIs in R

Q1. Calculate the mean and 95% CI for: 1, 2, 3, 4, 5, 6, 7, 8, 9. Which of the following is the correct lower bound of the 95% CI?






Q2. Which method for computing confidence intervals does NOT assume that the data follow a normal distribution?






Data Distributions

One of the most important — and most often neglected — steps in any quantitative analysis is examining the shape of your data distribution. Descriptive statistics like the mean and standard deviation assume (or are sensitive to) specific distributional shapes. Understanding whether your data are symmetric, skewed, or heavy-tailed is essential for choosing the right analysis and reporting your data responsibly.

What Is a Distribution?

A distribution describes how values in a dataset are spread across the possible range of values. Key properties include:

  • Shape: Is it symmetric, left-skewed, or right-skewed?
  • Modality: Is there one peak (unimodal) or more (bimodal, multimodal)?
  • Tails: Are extreme values common (heavy tails) or rare (thin tails)?
  • Centre and spread: Where do values cluster, and how widely do they range?

Histograms

The histogram is the most fundamental tool for visualising a distribution. It divides the range of values into equal-width bins and counts how many observations fall into each bin.

In R, a histogram is produced with hist() or (recommended) ggplot2:

Code
# simulate a right-skewed corpus frequency distribution  
set.seed(42)  
corpus_freq <- rgamma(200, shape = 2, rate = 0.1)  
  
# base R  
hist(corpus_freq,  
     main = "Corpus word frequencies (simulated)",  
     xlab = "Frequency per million",  
     col  = "steelblue", border = "white")  

Code
# ggplot2 (more flexible)  
ggplot(data.frame(x = corpus_freq), aes(x = x)) +  
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +  
  theme_bw() +  
  labs(title = "Corpus word frequencies (simulated)",  
       x = "Frequency per million", y = "Count")  

Choosing the Number of Bins

The number of bins (bins in ggplot2, breaks in base R) dramatically affects the appearance of a histogram. Too few bins: the shape is obscured. Too many bins: noise dominates the signal.

  • Start with bins = 20–30 and adjust visually
  • Rule of thumb: Sturges’ rule suggests \(k = \lceil \log_2 n \rceil + 1\) bins
  • The breaks = "FD" option in hist() uses the Freedman-Diaconis rule, which is robust to outliers

Density Plots

A density plot is a smoothed version of the histogram. It estimates the underlying probability density function of the data using kernel density estimation (KDE). Density plots are particularly useful for comparing multiple distributions on the same plot.

Code
set.seed(42)  
# simulate reaction times for two conditions  
rt_data <- data.frame(  
  RT        = c(rnorm(200, 450, 80), rnorm(200, 520, 100)),  
  Condition = rep(c("Congruent", "Incongruent"), each = 200)  
)  
  
rt_data |>  
  ggplot(aes(x = RT, fill = Condition, color = Condition)) +  
  geom_density(alpha = 0.4, linewidth = 0.8) +  
  theme_bw() +  
  scale_fill_manual(values  = c("steelblue", "tomato")) +  
  scale_color_manual(values = c("steelblue", "tomato")) +  
  labs(title = "Reaction time distributions: Congruent vs. Incongruent (simulated)",  
       x = "Reaction Time (ms)", y = "Density") +  
  theme(legend.position = "top", panel.grid.minor = element_blank())  

The density plot clearly shows that the Incongruent condition produces slower (rightward-shifted) reaction times with greater variability — a classic interference effect.


Q-Q Plots: Testing for Normality Visually

A Q-Q plot (quantile-quantile plot) compares the quantiles of your data against the quantiles expected from a theoretical normal distribution. If the data are normally distributed, the points fall along a straight diagonal line.

Code
set.seed(42)  
# normal data: points should follow the diagonal  
normal_data <- rnorm(200, mean = 0, sd = 1)  
  
# skewed data: points will curve away from the diagonal  
skewed_data <- rgamma(200, shape = 1.5, rate = 1)  
  
par(mfrow = c(1, 2))  
qqnorm(normal_data, main = "Q-Q plot: Normal data",  pch = 16, col = "steelblue")  
qqline(normal_data, col = "red", lwd = 2)  
  
qqnorm(skewed_data, main = "Q-Q plot: Skewed data",  pch = 16, col = "tomato")  
qqline(skewed_data, col = "red", lwd = 2)  

Code
par(mfrow = c(1, 1))  

How to read a Q-Q plot

  • Points on the line: data are consistent with normality
  • Points curving upward at the right tail: right skew (long upper tail)
  • Points curving downward at the left tail: left skew
  • Points deviating at both tails in S-shape: heavy tails (leptokurtosis)
Q-Q Plot vs. Shapiro-Wilk: Which to Use?

The Shapiro-Wilk test formally tests H₀: “the data are normally distributed.” However, for large samples (n > ~100), Shapiro-Wilk detects trivially small deviations as “significant” — deviations too small to affect any practical analysis.

Best practice: Use Q-Q plots for visual assessment. Use Shapiro-Wilk as a secondary check, and interpret the p-value alongside effect size and visual inspection, not as the sole decision criterion.

Code
# Shapiro-Wilk normality test  
shapiro.test(normal_data)  # should not reject normality  

    Shapiro-Wilk normality test

data:  normal_data
W = 0.9966720515, p-value = 0.946418142
Code
shapiro.test(skewed_data)  # should reject normality  

    Shapiro-Wilk normality test

data:  skewed_data
W = 0.8583236518, p-value = 0.00000000000111047774

Skewness

Skewness measures the asymmetry of a distribution. A perfectly symmetric distribution has skewness = 0.

\[\text{skewness} = \frac{1}{n} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^3\]

Skewness value Interpretation
= 0 Perfectly symmetric
Slightly positive (0 to 0.5) Approximately symmetric
Moderately positive (0.5 to 1) Moderate right skew
Strongly positive (> 1) Strong right skew
Negative values Mirror-image interpretations for left skew

In R, skewness is calculated using e1071::skewness() or psych::describe():

Code
library(e1071)  
  
# right-skewed corpus frequency data  
set.seed(42)  
corpus_freq <- rgamma(300, shape = 1.5, rate = 0.05)  
  
e1071::skewness(corpus_freq)  
[1] 1.31934302261
Code
# psych::describe() gives skewness alongside other stats  
psych::describe(corpus_freq)  
   vars   n  mean    sd median trimmed   mad  min   max  range skew kurtosis
X1    1 300 28.48 23.83  21.05   25.09 20.09 0.23 123.3 123.07 1.32      1.6
     se
X1 1.38
Skewness in Linguistics

Many linguistic variables are right-skewed (positive skewness):

  • Word frequencies: A few words (function words) are very frequent; most words are rare
  • Sentence lengths: Most sentences are moderate length; some are very long
  • Reaction times: Most responses are fast; occasional very slow responses stretch the upper tail
  • Pause durations: Most pauses are short; some are very long

This is why the median and non-parametric statistics are often preferred in corpus linguistics.


Kurtosis

Kurtosis measures the “peakedness” or “tail heaviness” of a distribution relative to the normal distribution. Excess kurtosis is reported relative to the normal distribution (which has kurtosis = 3; excess kurtosis = 0).

\[\text{kurtosis} = \frac{1}{n} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^4\]

Type Excess kurtosis Shape
Mesokurtic (normal) ≈ 0 Standard bell curve
Leptokurtic > 0 (positive) Tall narrow peak, heavy tails — extreme values are more common
Platykurtic < 0 (negative) Flat, short peak, thin tails — values spread more evenly

In R:

Code
# kurtosis (excess kurtosis; 0 = normal)  
e1071::kurtosis(corpus_freq)  
[1] 1.60446769286

Exercises: Distributions

Q1. A histogram of word frequencies in a corpus shows a very long right tail, with most words occurring fewer than 10 times but a few function words occurring thousands of times. What does this indicate?






Q2. In a Q-Q plot, the data points follow a straight diagonal line closely in the middle but curve sharply upward at the right end. What does this indicate?






Q3. Which function in R calculates the skewness of a numeric vector?






Comprehensive Data Summaries

Before running any statistical analysis, it is good practice to generate a comprehensive summary of your dataset. This reveals the structure, scale, and quirks of your variables — and often uncovers data quality issues (missing values, impossible values, unexpected distributions) before they cause problems in analysis.


The summary() Function

R’s built-in summary() is the quickest way to get an overview of a dataset. It automatically applies the appropriate summary to each variable type:

Code
# create a small linguistic dataset  
set.seed(42)  
linguistics_data <- data.frame(  
  speaker_id  = paste0("S", 1:50),  
  age         = round(rnorm(50, mean = 35, sd = 12)),  
  gender      = sample(c("Female", "Male", "Non-binary"), 50,  
                       replace = TRUE, prob = c(0.48, 0.48, 0.04)),  
  proficiency = ordered(sample(c("Beginner", "Intermediate", "Advanced"), 50,  
                               replace = TRUE), levels = c("Beginner","Intermediate","Advanced")),  
  RT_ms       = round(rnorm(50, 450, 90)),  
  errors      = rpois(50, lambda = 3)  
)  
  
summary(linguistics_data)  
  speaker_id             age           gender                proficiency
 Length:50          Min.   : 3.00   Length:50          Beginner    :13  
 Class :character   1st Qu.:27.25   Class :character   Intermediate:24  
 Mode  :character   Median :34.00   Mode  :character   Advanced    :13  
                    Mean   :34.56                                       
                    3rd Qu.:43.00                                       
                    Max.   :62.00                                       
     RT_ms            errors    
 Min.   :268.00   Min.   :0.00  
 1st Qu.:382.25   1st Qu.:2.00  
 Median :414.50   Median :3.00  
 Mean   :427.04   Mean   :3.14  
 3rd Qu.:460.75   3rd Qu.:4.00  
 Max.   :693.00   Max.   :8.00  

summary() gives:
- Numeric variables: min, Q1, median, mean, Q3, max (and count of NAs)
- Character variables: length and type
- Factor/ordered variables: counts per level


psych::describe() for Detailed Summaries

The describe() function from the psych package extends summary() by adding standard deviation, standard error, skewness, kurtosis, and range — everything you need for a complete descriptive statistics table:

Code
# detailed summary of numeric variables  
psych::describe(linguistics_data[, c("age", "RT_ms", "errors")])  
       vars  n   mean    sd median trimmed   mad min max range  skew kurtosis
age       1 50  34.56 13.77   34.0   35.02 12.60   3  62    59 -0.26    -0.33
RT_ms     2 50 427.04 79.91  414.5  422.55 66.72 268 693   425  0.69     1.10
errors    3 50   3.14  1.78    3.0    3.05  1.48   0   8     8  0.47    -0.03
          se
age     1.95
RT_ms  11.30
errors  0.25

The output columns are: n, mean, sd, median, trimmed (trimmed mean), mad (median absolute deviation), min, max, range, skew, kurtosis, se.


psych::describeBy() for Grouped Summaries

When you need summaries by group (e.g., by gender, condition, or proficiency level), describeBy() is invaluable:

Code
# summary by gender  
psych::describeBy(  
  x    = linguistics_data[, c("age", "RT_ms", "errors")],  
  group = linguistics_data$gender  
)  

 Descriptive statistics by group 
group: Female
       vars  n   mean    sd median trimmed   mad min max range  skew kurtosis
age       1 30  37.27 15.12   40.0   38.04 16.31   6  62    56 -0.38    -0.74
RT_ms     2 30 415.33 74.40  410.5  412.00 59.30 268 580   312  0.24    -0.20
errors    3 30   3.17  1.66    3.0    3.12  1.48   0   7     7  0.18    -0.26
          se
age     2.76
RT_ms  13.58
errors  0.30
------------------------------------------------------------ 
group: Male
       vars  n   mean    sd median trimmed   mad min max range  skew kurtosis
age       1 18  31.22 10.36   31.5   32.19  8.90   3  44    41 -1.06     0.89
RT_ms     2 18 452.00 87.71  448.5  445.31 57.82 318 693   375  0.89     0.92
errors    3 18   3.22  2.07    3.0    3.12  1.48   0   8     8  0.54    -0.48
          se
age     2.44
RT_ms  20.67
errors  0.49
------------------------------------------------------------ 
group: Non-binary
       vars n mean    sd median trimmed   mad min max range skew kurtosis se
age       1 2   24 14.14     24      24 14.83  14  34    20    0    -2.75 10
RT_ms     2 2  378 38.18    378     378 40.03 351 405    54    0    -2.75 27
errors    3 2    2  0.00      2       2  0.00   2   2     0  NaN      NaN  0

Creating a Publication-Ready Summary Table

For reporting, you often want a clean, formatted table. Here is a workflow using dplyr and flextable:

Code
library(dplyr)  
library(flextable)  
  
summary_table <- linguistics_data |>  
  dplyr::group_by(gender) |>  
  dplyr::summarise(  
    n        = n(),  
    Age_M    = round(mean(age), 1),  
    Age_SD   = round(sd(age), 1),  
    RT_M     = round(mean(RT_ms), 1),  
    RT_SD    = round(sd(RT_ms), 1),  
    Errors_M = round(mean(errors), 1)  
  )  
  
summary_table |>  
  flextable::flextable() |>  
  flextable::set_table_properties(width = .85, layout = "autofit") |>  
  flextable::theme_zebra() |>  
  flextable::fontsize(size = 12) |>  
  flextable::fontsize(size = 12, part = "header") |>  
  flextable::align_text_col(align = "center") |>  
  flextable::set_caption(caption = "Descriptive statistics by gender group (M = mean, SD = standard deviation).") |>  
  flextable::border_outer()  

gender

n

Age_M

Age_SD

RT_M

RT_SD

Errors_M

Female

30

37.3

15.1

415.3

74.4

3.2

Male

18

31.2

10.4

452.0

87.7

3.2

Non-binary

2

24.0

14.1

378.0

38.2

2.0

What to Include in a Descriptive Statistics Table

A standard descriptive statistics table for a linguistic study should report, for each group and variable:

  • n (sample size per group)
  • Mean (M) and Standard Deviation (SD) for continuous variables
  • Median (Mdn) if data are skewed or ordinal
  • Range or IQR if variability context is important
  • Percentages for categorical variables

Many journals also request skewness and kurtosis values to justify the use of parametric vs. non-parametric tests.


Exercises: Data Summaries

Q1. What does R’s built-in summary() function report for a numeric variable?






Q2. Which function provides skewness and kurtosis alongside mean, SD, and SE in a single output?






Frequency Tables and Cross-Tabulations

Frequency tables are essential for describing categorical (nominal and ordinal) variables. They show how many observations fall into each category — the foundation for describing language use patterns, speaker demographics, and grammatical choices.


Simple Frequency Tables

The table() function in base R creates frequency tables:

Code
# frequency table of grammatical construction choices (simulated)  
set.seed(42)  
n_tokens <- 200  
constructions <- sample(  
  c("morphological (-er)", "periphrastic (more)", "mixed"),  
  size    = n_tokens,  
  replace = TRUE,  
  prob    = c(0.55, 0.35, 0.10)  
)  
  
# absolute frequencies  
(freq_abs <- table(constructions))  
constructions
              mixed morphological (-er) periphrastic (more) 
                 26                  99                  75 
Code
# relative frequencies (proportions)  
(freq_rel <- prop.table(freq_abs))  
constructions
              mixed morphological (-er) periphrastic (more) 
              0.130               0.495               0.375 
Code
# relative frequencies as percentages  
round(freq_rel * 100, 1)  
constructions
              mixed morphological (-er) periphrastic (more) 
               13.0                49.5                37.5 

For a polished table output:

Code
data.frame(  
  Construction = names(freq_abs),  
  Count        = as.integer(freq_abs),  
  Percent      = round(as.numeric(freq_rel) * 100, 1)  
) |>  
  flextable::flextable() |>  
  flextable::set_table_properties(width = .55, layout = "autofit") |>  
  flextable::theme_zebra() |>  
  flextable::fontsize(size = 12) |>  
  flextable::fontsize(size = 12, part = "header") |>  
  flextable::align_text_col(align = "center") |>  
  flextable::set_caption(caption = "Frequency of comparative construction types (simulated corpus data).") |>  
  flextable::border_outer()  

Construction

Count

Percent

mixed

26

13.0

morphological (-er)

99

49.5

periphrastic (more)

75

37.5


Two-Way Cross-Tabulations

A cross-tabulation (contingency table) shows the joint distribution of two categorical variables. This is the standard way to examine whether the frequency distribution of one variable differs across levels of another.

Code
# simulate comparative construction data with register variable  
set.seed(42)  
n <- 300  
register      <- sample(c("Formal", "Informal"), n, replace = TRUE, prob = c(0.5, 0.5))  
construction  <- ifelse(register == "Formal",  
                        sample(c("periphrastic", "morphological"), n, replace = TRUE, prob = c(0.70, 0.30)),  
                        sample(c("periphrastic", "morphological"), n, replace = TRUE, prob = c(0.35, 0.65)))  
  
(crosstab <- table(Register = register, Construction = construction))  
          Construction
Register   morphological periphrastic
  Formal              43          106
  Informal           109           42
Code
# row proportions: what proportion of each register uses each construction?  
round(prop.table(crosstab, margin = 1) * 100, 1)  
          Construction
Register   morphological periphrastic
  Formal            28.9         71.1
  Informal          72.2         27.8
Code
# visualise the cross-tabulation as a grouped bar chart  
data.frame(  
  Register     = rep(rownames(crosstab), ncol(crosstab)),  
  Construction = rep(colnames(crosstab), each = nrow(crosstab)),  
  Count        = as.vector(crosstab)  
) |>  
  ggplot(aes(x = Register, y = Count, fill = Construction)) +  
  geom_bar(stat = "identity", position = "dodge", alpha = 0.85) +  
  theme_bw() +  
  scale_fill_manual(values = c("steelblue", "tomato")) +  
  labs(title = "Comparative construction choice by register (simulated)",  
       y = "Count", x = "Register") +  
  theme(legend.position = "top", panel.grid.minor = element_blank())  

From Description to Inference

A cross-tabulation describes the data. To test whether the association between two categorical variables is statistically significant, we use a chi-square test of independence (chisq.test() in R). But that is inferential statistics — cross-tabulations are the descriptive foundation.


Exercises: Frequency Tables

Q1. A researcher finds the following construction counts: morphological = 110, periphrastic = 90. What proportion uses the periphrastic form?






Q2. In a cross-tabulation, prop.table(tab, margin = 1) computes






Data Visualisation for Description

Effective data visualisation is as important as numerical summaries. Graphs reveal patterns, outliers, and distributional shapes that tables cannot communicate. This section covers the most important plot types for descriptive purposes.


Boxplots

The boxplot (box-and-whisker plot) is the standard visualisation of the five-number summary: minimum, Q1, median, Q3, maximum. Outliers (values more than 1.5 × IQR beyond Q1 or Q3) are plotted individually.

Code
set.seed(42)  
rt_viz <- data.frame(  
  RT        = c(rnorm(80, 450, 70), rnorm(80, 520, 100)),  
  Condition = rep(c("Congruent", "Incongruent"), each = 80)  
)  
  
rt_viz |>  
  ggplot(aes(x = Condition, y = RT, fill = Condition)) +  
  geom_boxplot(alpha = 0.7, outlier.color = "red", outlier.shape = 16) +  
  theme_bw() +  
  scale_fill_manual(values = c("steelblue", "tomato")) +  
  labs(title = "Reaction times by condition (simulated)",  
       y = "Reaction Time (ms)", x = "") +  
  theme(legend.position = "none", panel.grid.minor = element_blank())  


Violin Plots

The violin plot combines the boxplot’s summary statistics with the density plot’s distributional information. It is particularly useful when distributions are non-normal or multimodal.

Code
rt_viz |>  
  ggplot(aes(x = Condition, y = RT, fill = Condition)) +  
  geom_violin(alpha = 0.6, trim = FALSE) +  
  geom_boxplot(width = 0.15, fill = "white", outlier.color = "red") +  
  theme_bw() +  
  scale_fill_manual(values = c("steelblue", "tomato")) +  
  labs(title = "Violin + boxplot: reaction times by condition (simulated)",  
       y = "Reaction Time (ms)", x = "") +  
  theme(legend.position = "none", panel.grid.minor = element_blank())  

Boxplot vs. Violin Plot

Use a boxplot when you want to emphasise the five-number summary and outliers, and when your audience is familiar with boxplots.

Use a violin plot (especially with embedded boxplot) when:
- The distribution shape matters (e.g., bimodal vs. unimodal)
- You want to show density alongside summary statistics
- You are comparing multiple groups and want to see distributional differences beyond spread


Bar Charts for Categorical Data

Bar charts are the standard visualisation for categorical (frequency) data. Use geom_bar() for raw data or geom_col() when you have pre-computed counts:

Code
# bar chart with error bars (mean ± SE)  
set.seed(42)  
bar_data <- data.frame(  
  Group     = rep(c("Group A", "Group B", "Group C"), each = 30),  
  Score     = c(rnorm(30, 70, 10), rnorm(30, 75, 12), rnorm(30, 65, 8))  
)  
  
bar_summary <- bar_data |>  
  dplyr::group_by(Group) |>  
  dplyr::summarise(  
    M  = mean(Score),  
    SE = sd(Score) / sqrt(n()),  
    .groups = "drop"  
  )  
  
bar_summary |>  
  ggplot(aes(x = Group, y = M, fill = Group)) +  
  geom_col(alpha = 0.8, width = 0.6) +  
  geom_errorbar(aes(ymin = M - SE, ymax = M + SE), width = 0.2, linewidth = 0.8) +  
  coord_cartesian(ylim = c(55, 90)) +  
  theme_bw() +  
  scale_fill_manual(values = c("steelblue", "tomato", "seagreen")) +  
  labs(title = "Mean scores by group with standard error bars",  
       y = "Mean Score", x = "") +  
  theme(legend.position = "none", panel.grid.minor = element_blank())  


Scatterplots

A scatterplot visualises the relationship between two continuous variables. It is the standard first step in any correlation analysis.

Code
set.seed(42)  
scatter_data <- data.frame(  
  Syllables  = sample(1:4, 150, replace = TRUE),  
  Periphrastic_pct = NA  
)  
scatter_data$Periphrastic_pct <- 30 - scatter_data$Syllables * 5 +   
  rnorm(150, 0, 8) + scatter_data$Syllables * 12 +   
  rnorm(150, 0, 5)  
scatter_data$Periphrastic_pct <- pmax(0, pmin(100, scatter_data$Periphrastic_pct))  
  
scatter_data |>  
  ggplot(aes(x = Syllables, y = Periphrastic_pct)) +  
  geom_jitter(width = 0.15, alpha = 0.5, color = "steelblue") +  
  geom_smooth(method = "lm", se = TRUE, color = "tomato", fill = "tomato", alpha = 0.2) +  
  theme_bw() +  
  labs(title = "Adjective syllable count vs. periphrastic comparative use (simulated)",  
       x = "Number of syllables", y = "% periphrastic comparative") +  
  theme(panel.grid.minor = element_blank())  

The regression line and confidence band summarise the trend: longer adjectives tend to use the periphrastic comparative more frequently — a well-established finding in the comparative literature.


Correlation

Correlation is one of the most commonly reported descriptive statistics for quantifying the strength and direction of a linear relationship between two continuous variables. It is important to note that correlation is a descriptive statistic — it quantifies an observed association, not a causal relationship.

Correlation ≠ Causation

A correlation between X and Y tells us they co-vary systematically. It does not tell us:

  • Whether X causes Y
  • Whether Y causes X
  • Whether a third variable Z causes both

Causal inference requires experimental design, not correlation alone.


Pearson’s r

Pearson’s product-moment correlation coefficient (r) measures the strength of a linear relationship between two normally distributed continuous variables.

\[r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \cdot \sum_{i=1}^{n}(y_i - \bar{y})^2}}\]

Pearson’s r ranges from −1 (perfect negative linear relationship) through 0 (no linear relationship) to +1 (perfect positive linear relationship).

r value

Strength

Typical context

|r| = 1.0

Perfect

Measurement error only

|r| ≥ 0.9

Very strong

Measurement reliability studies

|r| ≥ 0.7

Strong

Well-established predictors in psychology/linguistics

|r| ≥ 0.5

Moderate

Typical effect size in social sciences

|r| ≥ 0.3

Weak

Small but potentially meaningful effect

|r| < 0.3

Negligible

Often noise — interpret with caution

In R:

Code
set.seed(42)  
syllables <- sample(1:4, 100, replace = TRUE)  
periphrastic <- 20 + syllables * 15 + rnorm(100, 0, 10)  
  
# Pearson's r  
cor(syllables, periphrastic, method = "pearson")  
[1] 0.881733492037
Code
# with significance test  
cor.test(syllables, periphrastic, method = "pearson")  

    Pearson's product-moment correlation

data:  syllables and periphrastic
t = 18.50292682, df = 98, p-value < 0.0000000000000002220446
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.828865247590 0.918992703621
sample estimates:
           cor 
0.881733492037 

Spearman’s ρ (rho)

Spearman’s rank correlation coefficient (ρ) is the non-parametric alternative to Pearson’s r. It measures the strength of a monotonic (not necessarily linear) relationship and is appropriate when:

  • The data are ordinal (e.g., Likert scale ratings, ranked judgements)
  • The data are continuous but non-normal (e.g., skewed frequency data)
  • Outliers are present that would distort Pearson’s r

Spearman’s ρ is calculated by ranking both variables and then computing Pearson’s r on the ranks.

Code
# Spearman correlation (more robust to skew and outliers)  
cor.test(syllables, periphrastic, method = "spearman")  

    Spearman's rank correlation rho

data:  syllables and periphrastic
S = 16367.30305, p-value < 0.0000000000000002220446
alternative hypothesis: true rho is not equal to 0
sample estimates:
           rho 
0.901786360316 

Correlation Matrices

When you have multiple variables and want to explore their pairwise relationships, a correlation matrix is efficient:

Code
set.seed(42)  
corr_data <- data.frame(  
  syllables       = sample(1:4, 80, replace = TRUE),  
  word_freq       = round(rgamma(80, 2, 0.01)),  
  formality_score = rnorm(80, 50, 15),  
  RT_ms           = rnorm(80, 480, 80)  
)  
  
# correlation matrix (Pearson)  
round(cor(corr_data, method = "pearson"), 2)  
                syllables word_freq formality_score RT_ms
syllables            1.00     -0.05            0.06  0.09
word_freq           -0.05      1.00            0.11  0.05
formality_score      0.06      0.11            1.00 -0.25
RT_ms                0.09      0.05           -0.25  1.00

For a visually appealing correlation matrix, use the ggcorrplot package or ggpubr::ggscatter() for pairwise scatterplots:

Code
# pairwise scatterplot matrix with correlation coefficients  
pairs(corr_data,  
      main  = "Pairwise scatterplot matrix",  
      pch   = 16,  
      col   = adjustcolor("steelblue", alpha.f = 0.6),  
      lower.panel = function(x, y, ...) {  
        r <- round(cor(x, y), 2)  
        par(usr = c(0, 1, 0, 1))  
        text(0.5, 0.5, paste0("r = ", r), cex = 1.2,  
             col = ifelse(abs(r) > 0.3, "red", "gray40"))  
      })  


Exercises: Correlation

Q1. A researcher calculates Pearson’s r = −0.72 between adjective syllable count and the rate of morphological comparative use. What does this mean?






Q2. When should Spearman’s ρ be preferred over Pearson’s r?






Outlier Detection

Outliers are observations that deviate markedly from the rest of the data. They can arise from data entry errors, measurement problems, genuine extreme cases, or non-normal distributions. Detecting and understanding outliers is a critical part of descriptive analysis — they can distort means, inflate variances, and violate the assumptions of statistical tests.

Outliers: Exclude or Investigate?

Outliers should not be automatically removed. Before excluding any observation, ask:

  1. Is it a data entry or measurement error? → Fix or exclude with justification
  2. Is it a genuine extreme value from the population of interest? → Retain and report
  3. Is it from a different population? (e.g., a non-native speaker in a native-speaker study) → Exclusion criterion should be defined a priori

Always report how many observations were identified as outliers and what decision was made. Transparency is essential.


Method 1: The Z-Score Method

A z-score expresses each value as the number of standard deviations from the mean. Values with \(|z| > 3\) are commonly flagged as potential outliers (corresponding to approximately the outer 0.3% of a normal distribution).

Code
set.seed(42)  
# simulate reading time data with one extreme outlier  
reading_times <- c(rnorm(99, mean = 450, sd = 80), 1850)  
  
# calculate z-scores  
z_scores <- scale(reading_times)[, 1]  
  
# identify outliers: |z| > 3  
outliers_z <- which(abs(z_scores) > 3)  
cat("Outliers (z-score method):", outliers_z, "\n")  
Outliers (z-score method): 100 
Code
cat("Outlier values:", reading_times[outliers_z], "\n")  
Outlier values: 1850 

Method 2: The IQR Method

The IQR method (Tukey’s fences) is more robust than the z-score approach because it does not depend on the mean or standard deviation, which are themselves distorted by outliers. An observation is flagged if it falls more than 1.5 × IQR below Q1 or above Q3.

\[\text{Lower fence} = Q_1 - 1.5 \times \text{IQR}\]
\[\text{Upper fence} = Q_3 + 1.5 \times \text{IQR}\]

This is the same rule used by ggplot2 boxplots to plot individual outlier points.

Code
Q1  <- quantile(reading_times, 0.25)  
Q3  <- quantile(reading_times, 0.75)  
IQR_val <- IQR(reading_times)  
  
lower_fence <- Q1 - 1.5 * IQR_val  
upper_fence <- Q3 + 1.5 * IQR_val  
  
cat("Lower fence:", round(lower_fence, 1), "\n")  
Lower fence: 244.4 
Code
cat("Upper fence:", round(upper_fence, 1), "\n")  
Upper fence: 661.1 
Code
outliers_iqr <- which(reading_times < lower_fence | reading_times > upper_fence)  
cat("Outliers (IQR method):", reading_times[outliers_iqr], "\n")  
Outliers (IQR method): 237.483566328 210.552793348 1850 

Visualising Outliers

The best way to identify and communicate outliers is visually:

Code
df_rt <- data.frame(RT = reading_times,  
                    is_outlier = abs(scale(reading_times)[,1]) > 3)  
  
# boxplot: outliers shown as individual points  
p1 <- df_rt |>  
  ggplot(aes(y = RT)) +  
  geom_boxplot(fill = "steelblue", alpha = 0.6, outlier.color = "red",  
               outlier.size = 3) +  
  theme_bw() +  
  labs(title = "Boxplot", y = "Reading Time (ms)", x = "") +  
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())  
  
# dot plot: all points, outlier highlighted  
p2 <- df_rt |>  
  ggplot(aes(x = seq_along(RT), y = RT, color = is_outlier)) +  
  geom_point(alpha = 0.6) +  
  scale_color_manual(values = c("steelblue", "red"),  
                     labels = c("Normal", "Outlier")) +  
  theme_bw() +  
  labs(title = "Dot plot with outlier flagged",  
       x = "Observation index", y = "Reading Time (ms)", color = "") +  
  theme(legend.position = "top")  
  
# display side by side using cowplot  
plot_grid(p1, p2, ncol = 2, rel_widths = c(0.35, 0.65))  


Exercises: Outliers

Q1. A reaction time dataset has Q1 = 380ms, Q3 = 520ms, and IQR = 140ms. Using Tukey’s fences (1.5 × IQR), what is the upper fence above which values are flagged as outliers?






Q2. Why is the IQR method for detecting outliers generally preferred over the z-score method?






Quick Reference

Choosing the Right Descriptive Statistic

Question

Statistic

R function

What is the typical value?

Arithmetic mean

mean(x)

What is the typical value? (skewed/ordinal)

Median

median(x)

What is the most common category?

Mode

names(which.max(table(x)))

How spread out are the values?

Range or IQR

range(x); IQR(x)

How spread out? (robust to outliers)

IQR (interquartile range)

IQR(x)

How spread out? (for reporting with mean)

Standard deviation (SD)

sd(x)

How precise is my mean estimate?

Standard error (SE)

sd(x)/sqrt(length(x))

Is the distribution symmetric?

Skewness; Q-Q plot; histogram

e1071::skewness(x); qqnorm(x)

Are there extreme values?

Z-scores or IQR fences; boxplot

scale(x); IQR method

How strong is the relationship between two variables?

Pearson's r

cor(x, y, method='pearson')

How strong? (ordinal/non-normal data)

Spearman's ρ

cor(x, y, method='spearman')

Visualisation Cheat Sheet

Data type Best plots
Single continuous variable Histogram, density plot, boxplot
Continuous + grouping variable Boxplot by group, violin plot, density overlay
Two continuous variables Scatterplot (with regression line)
Single categorical variable Bar chart, frequency table
Two categorical variables Grouped bar chart, cross-tabulation, mosaic plot
Checking normality Q-Q plot, histogram vs. normal curve
Checking for outliers Boxplot, dot plot, z-score plot

Complete Workflow Checklist

Use this checklist every time you begin a new descriptive analysis:

Step Action R tool
1 Load data readr::read_csv()
2 Inspect structure str(), dplyr::glimpse()
3 Check for missing values colSums(is.na())
4 Check factor levels for typos table(data$var)
5 Check numeric ranges for impossible values summary(), range()
6 Compute central tendency mean(), median(), mode via which.max(table())
7 Compute variability sd(), IQR(), mad()
8 Check distribution shape hist(), qqnorm(), e1071::skewness()
9 Detect outliers boxplot(), z-score or IQR method
10 Grouped summaries psych::describeBy(), dplyr::group_by() |> summarise()
11 Compute correlations cor(), cor.test()
12 Calculate effect sizes effectsize::cohens_d(), effectsize::eta_squared()
13 Build reporting table flextable::flextable()

Effect Size Benchmarks

Statistic Small Medium Large Use when
Cohen’s d 0.2 0.5 0.8 Comparing two group means
Pearson’s r 0.1 0.3 0.5 Two continuous variables
Eta-squared (η²) 0.01 0.06 0.14 ANOVA: proportion of variance

Robust Measures and Resistant Statistics

Standard measures like the mean and standard deviation assume — or are strongly affected by — the presence of extreme values and departures from normality. Robust statistics are designed to remain informative even when these assumptions are violated. They are particularly important in linguistics, where corpus frequency data, reaction times, and judgement scores are routinely non-normal.


The Trimmed Mean

The trimmed mean discards a fixed percentage of the most extreme values at both ends of the distribution before calculating the mean. A common choice is the 5% trimmed mean (removing the bottom 5% and top 5% of values) or the 10% trimmed mean.

\[\bar{x}_{\text{trim}} = \frac{1}{n - 2k} \sum_{i=k+1}^{n-k} x_{(i)}\]

where \(x_{(i)}\) are the sorted values and \(k = \lfloor n \cdot \text{trim} \rfloor\) observations are dropped from each end.

Code
set.seed(42)  
# reaction times with a few extreme slow responses  
rt_skewed <- c(rnorm(95, mean = 450, sd = 70), c(1800, 1950, 2100, 1750, 1600))  
  
# regular mean — inflated by outliers  
mean(rt_skewed)  
[1] 523.679027955
Code
# 5% trimmed mean — more resistant  
mean(rt_skewed, trim = 0.05)  
[1] 464.198775463
Code
# 10% trimmed mean  
mean(rt_skewed, trim = 0.10)  
[1] 464.34869954
Code
# median for comparison  
median(rt_skewed)  
[1] 468.706526469

The trimmed mean sits between the mean (inflated by outliers) and the median (which ignores all values except the central one). It is reported in psych::describe() as the trimmed column (10% trim by default).


Winsorisation

Winsorisation replaces extreme values with the nearest non-extreme value rather than discarding them. For example, in a 5% Winsorised dataset, all values below the 5th percentile are replaced with the 5th percentile value, and all values above the 95th percentile are replaced with the 95th percentile value.

Unlike trimming, Winsorisation retains the full sample size — important when data are scarce.

Code
library(DescTools)  
  
# Winsorise at 5th and 95th percentiles  
rt_winsorised <- DescTools::Winsorize(rt_skewed)  
  
# compare means  
cat("Original mean:    ", round(mean(rt_skewed), 1), "\n")  
Original mean:     523.7 
Code
cat("Winsorised mean:  ", round(mean(rt_winsorised), 1), "\n")  
Winsorised mean:   467.1 
Code
cat("Trimmed mean:     ", round(mean(rt_skewed, trim = 0.05), 1), "\n")  
Trimmed mean:      464.2 
Code
cat("Median:           ", round(median(rt_skewed), 1), "\n")  
Median:            468.7 

Median Absolute Deviation (MAD)

The Median Absolute Deviation (MAD) is the most robust measure of variability. It is computed as the median of the absolute deviations from the median:

\[\text{MAD} = \text{median}\left(|x_i - \text{median}(x)|\right)\]

Because MAD uses the median twice, it is completely unaffected by extreme values. It is reported by psych::describe() as the mad column.

Code
# MAD is resistant to extreme values  
mad(rt_skewed, constant = 1)   # raw MAD (constant=1 gives actual median of |deviations|)  
[1] 47.7335411062
Code
mad(rt_skewed)                  # scaled MAD (default: multiply by 1.4826 for consistency with SD)  
[1] 70.7697480441
Code
# for comparison: SD is heavily inflated by the outliers  
sd(rt_skewed)  
[1] 314.124563525
MAD vs. SD: When to Use Which
Measure Use when
Standard deviation (SD) Data are approximately normally distributed; parametric tests will be used
MAD Data are skewed or contain outliers; non-parametric or robust methods are used
IQR Reporting alongside median in summary tables; ordinal or skewed data

In published linguistics research, SD is still most commonly reported — but MAD and trimmed means are increasingly used in psycholinguistics and corpus work, where data rarely meet normality assumptions.


Exercises: Robust Measures

Q1. A reaction time dataset has a mean of 520 ms and a 10% trimmed mean of 470 ms. What does this discrepancy suggest?






Q2. Why is the MAD considered more robust than the standard deviation?






Working with Real Data: Loading, Inspecting, and Cleaning

All previous sections have used simulated data. In practice, you will be loading messy, real-world datasets from files. This section covers the practical workflow for importing data, getting an initial overview, and handling missing values — essential skills before any descriptive analysis.


Loading Data

The most common data formats in linguistics research are CSV files (comma-separated values) and Excel spreadsheets. The recommended approach is the readr package (part of tidyverse) for CSVs and readxl for Excel files.

Code
library(readr)  
library(readxl)  
  
# load a CSV file  
data_csv <- readr::read_csv("path/to/your/data.csv")  
  
# load an Excel file (specify sheet name or number)  
data_xlsx <- readxl::read_excel("path/to/your/data.xlsx", sheet = 1)  
  
# base R alternative for CSVs  
data_base <- read.csv("path/to/your/data.csv", stringsAsFactors = FALSE)  
read_csv() vs. read.csv()

Prefer readr::read_csv() over base R’s read.csv() because it:

  • Does not convert strings to factors by default (no need for stringsAsFactors = FALSE)
  • Reads files faster for large datasets
  • Reports a column specification summary so you can verify that columns were parsed correctly
  • Creates a tibble (tidyverse data frame) with cleaner printing behaviour

Getting an Initial Overview

Before computing any statistics, always inspect your data structure:

Code
# create a realistic linguistics dataset to demonstrate  
set.seed(42)  
n <- 120  
linguistics_corpus <- data.frame(  
  text_id      = paste0("T", sprintf("%03d", 1:n)),  
  register     = sample(c("Spoken", "Written", "CMC"), n, replace = TRUE),  
  word_count   = round(rnorm(n, 850, 300)),  
  TTR          = round(runif(n, 0.3, 0.8), 3),    # type-token ratio  
  mean_wl      = round(rnorm(n, 4.8, 0.9), 2),    # mean word length in chars  
  clauses_per_sent = round(rnorm(n, 2.1, 0.6), 1),  
  speaker_age  = c(round(rnorm(n - 5, 35, 12)), rep(NA, 5))  # 5 missing values  
)  
  
# 1. Check dimensions (rows × columns)  
dim(linguistics_corpus)  
[1] 120   7
Code
# 2. Variable names and types  
str(linguistics_corpus)  
'data.frame':   120 obs. of  7 variables:
 $ text_id         : chr  "T001" "T002" "T003" "T004" ...
 $ register        : chr  "Spoken" "Spoken" "Spoken" "Spoken" ...
 $ word_count      : num  990 888 594 575 707 582 783 835 521 920 ...
 $ TTR             : num  0.338 0.311 0.557 0.615 0.509 0.74 0.354 0.79 0.432 0.342 ...
 $ mean_wl         : num  4.98 6.15 3.77 5.25 6.07 4.24 4.13 5.14 3.62 4.87 ...
 $ clauses_per_sent: num  3.1 2.4 0.7 2.4 2 2.2 1.3 1.6 0.7 2.9 ...
 $ speaker_age     : num  6 17 36 11 42 46 37 25 47 28 ...
Code
# 3. First few rows  
head(linguistics_corpus, 5)  
  text_id register word_count   TTR mean_wl clauses_per_sent speaker_age
1    T001   Spoken        990 0.338    4.98              3.1           6
2    T002   Spoken        888 0.311    6.15              2.4          17
3    T003   Spoken        594 0.557    3.77              0.7          36
4    T004   Spoken        575 0.615    5.25              2.4          11
5    T005  Written        707 0.509    6.07              2.0          42
Code
# 4. tidyverse-style overview (cleaner output)  
dplyr::glimpse(linguistics_corpus)  
Rows: 120
Columns: 7
$ text_id          <chr> "T001", "T002", "T003", "T004", "T005", "T006", "T007…
$ register         <chr> "Spoken", "Spoken", "Spoken", "Spoken", "Written", "W…
$ word_count       <dbl> 990, 888, 594, 575, 707, 582, 783, 835, 521, 920, 107…
$ TTR              <dbl> 0.338, 0.311, 0.557, 0.615, 0.509, 0.740, 0.354, 0.79…
$ mean_wl          <dbl> 4.98, 6.15, 3.77, 5.25, 6.07, 4.24, 4.13, 5.14, 3.62,…
$ clauses_per_sent <dbl> 3.1, 2.4, 0.7, 2.4, 2.0, 2.2, 1.3, 1.6, 0.7, 2.9, 3.1…
$ speaker_age      <dbl> 6, 17, 36, 11, 42, 46, 37, 25, 47, 28, 29, 14, 29, 31…
Code
# 5. Full summary  
summary(linguistics_corpus)  
   text_id            register           word_count            TTR             
 Length:120         Length:120         Min.   :   3.000   Min.   :0.301000000  
 Class :character   Class :character   1st Qu.: 584.250   1st Qu.:0.388500000  
 Mode  :character   Mode  :character   Median : 806.500   Median :0.508500000  
                                       Mean   : 829.025   Mean   :0.523408333  
                                       3rd Qu.:1074.750   3rd Qu.:0.638250000  
                                       Max.   :1557.000   Max.   :0.790000000  
                                                                               
    mean_wl           clauses_per_sent      speaker_age        
 Min.   :1.79000000   Min.   :0.50000000   Min.   : 5.0000000  
 1st Qu.:4.26750000   1st Qu.:1.50000000   1st Qu.:24.0000000  
 Median :4.82500000   Median :2.00000000   Median :34.0000000  
 Mean   :4.86241667   Mean   :2.00666667   Mean   :33.2869565  
 3rd Qu.:5.38500000   3rd Qu.:2.50000000   3rd Qu.:42.0000000  
 Max.   :7.19000000   Max.   :3.40000000   Max.   :66.0000000  
                                           NA's   :5           

The summary() output immediately reveals the 5 missing values (NA's: 5) in speaker_age — something you must address before computing means or running tests on that variable.


Handling Missing Values

Missing data (NA in R) are the rule rather than the exception in real linguistic datasets. R’s default behaviour is to propagate NA values: any arithmetic involving NA returns NA.

Code
# NA propagates through arithmetic  
mean(linguistics_corpus$speaker_age)           # returns NA  
[1] NA
Code
# solution: exclude NAs with na.rm = TRUE  
mean(linguistics_corpus$speaker_age, na.rm = TRUE)  
[1] 33.2869565217
Code
sd(linguistics_corpus$speaker_age,   na.rm = TRUE)  
[1] 13.1860939626
Code
median(linguistics_corpus$speaker_age, na.rm = TRUE)  
[1] 34

Identifying and Counting Missing Values

Code
# how many NAs per column?  
colSums(is.na(linguistics_corpus))  
         text_id         register       word_count              TTR 
               0                0                0                0 
         mean_wl clauses_per_sent      speaker_age 
               0                0                5 
Code
# which rows have NAs?  
which(is.na(linguistics_corpus$speaker_age))  
[1] 116 117 118 119 120
Code
# proportion of missing values per column  
round(colMeans(is.na(linguistics_corpus)) * 100, 1)  
         text_id         register       word_count              TTR 
             0.0              0.0              0.0              0.0 
         mean_wl clauses_per_sent      speaker_age 
             0.0              0.0              4.2 

Complete Cases

complete.cases() returns a logical vector: TRUE for rows with no missing values, FALSE for rows with at least one NA.

Code
# how many complete cases?  
sum(complete.cases(linguistics_corpus))  
[1] 115
Code
# filter to complete cases only  
corpus_complete <- linguistics_corpus[complete.cases(linguistics_corpus), ]  
nrow(corpus_complete)  
[1] 115
Missing Data: Do Not Simply Delete

Listwise deletion (removing any row with an NA) is the default in most R functions, but it is not always the best approach:

  • If data are Missing Completely at Random (MCAR): deletion is unbiased but reduces sample size
  • If data are Missing at Random (MAR) or Missing Not at Random (MNAR): deletion introduces bias

Advanced techniques — multiple imputation (mice package), maximum likelihood estimation, or mixed models — handle missing data more appropriately. Always report the number of missing values and your handling strategy in your methods section.


Checking Data Quality

Before analysing, always check for impossible or suspicious values:

Code
# check for impossible values (e.g., negative word counts or TTR > 1)  
sum(linguistics_corpus$word_count < 0)     # should be 0  
[1] 0
Code
sum(linguistics_corpus$TTR > 1)           # should be 0  
[1] 0
Code
# check range of each numeric variable  
sapply(linguistics_corpus[, sapply(linguistics_corpus, is.numeric)], range, na.rm = TRUE)  
     word_count   TTR mean_wl clauses_per_sent speaker_age
[1,]          3 0.301    1.79              0.5           5
[2,]       1557 0.790    7.19              3.4          66
Code
# check factor/character levels for typos  
table(linguistics_corpus$register)  

    CMC  Spoken Written 
     27      49      44 

This last step catches a very common problem: data entry inconsistencies such as "spoken" and "Spoken" being treated as different categories. Always inspect categorical variable levels before analysis.


Exercises: Real Data Workflow

Q1. You load a dataset and run mean(data$age) but get NA as the result. What is the most likely cause and solution?






Q2. Which function returns TRUE for rows in a data frame that have NO missing values in any column?






Q3. After loading a dataset, you find the register column has levels: ‘Formal’, ‘formal’, ‘FORMAL’, ‘Informal’, ‘informal’. What is the problem and how should you fix it?






Effect Sizes as Descriptive Statistics

Effect size statistics describe the magnitude of a difference or association in a way that is independent of sample size. While p-values only tell us whether an effect is statistically distinguishable from zero, effect sizes tell us how large the effect is — information that is essential for interpreting whether a finding is practically meaningful.

Why Effect Sizes Matter

A study with N = 10,000 participants can detect effects so tiny they are meaningless in practice. A study with N = 20 may miss a large, genuine effect. The p-value conflates effect size with sample size. Effect sizes do not.

Reporting effect sizes is now required by most major journals in linguistics, psychology, and the language sciences, following guidelines from the APA Publication Manual (7th edition) and initiatives such as the Open Science Framework.


Cohen’s d for Comparing Two Means

Cohen’s d measures the standardised difference between two group means — the difference in means expressed in units of the pooled standard deviation:

\[d = \frac{\bar{x}_1 - \bar{x}_2}{s_{\text{pooled}}}\]

\[s_{\text{pooled}} = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}\]

|d| value

Magnitude

Practical interpretation

< 0.2

Negligible

Likely noise

0.2

Small

Detectable but small

0.5

Medium

Noticeable in practice

0.8

Large

Clearly meaningful difference

> 1.2

Very large

Very large, dramatic difference

In R, Cohen’s d can be calculated manually or via effectsize::cohens_d():

Code
set.seed(42)  
# simulated reaction times: congruent vs. incongruent condition  
rt_cong   <- rnorm(50, mean = 440, sd = 75)  
rt_incong <- rnorm(50, mean = 510, sd = 85)  
  
# manual calculation  
pool_sd <- sqrt(((50-1)*var(rt_cong) + (50-1)*var(rt_incong)) / (50+50-2))  
d_manual <- (mean(rt_incong) - mean(rt_cong)) / pool_sd  
cat("Cohen's d (manual):", round(d_manual, 3), "\n")  
Cohen's d (manual): 0.984 
Code
# using effectsize package  
if (!requireNamespace("effectsize", quietly = TRUE)) {  
  install.packages("effectsize")  
}  
library(effectsize)  
effectsize::cohens_d(rt_incong, rt_cong)  
Cohen's d |       95% CI
------------------------
0.98      | [0.57, 1.40]

- Estimated using pooled SD.

Pearson’s r as an Effect Size

When the relationship between two variables is the focus (rather than a group difference), Pearson’s r itself serves as an effect size. The benchmarks are:

r
< 0.10 Negligible
0.10 Small
0.30 Medium
0.50 Large

Pearson’s r and Cohen’s d are mathematically related:

\[r = \frac{d}{\sqrt{d^2 + 4}}\]


Eta-squared (η²) for ANOVAs

Eta-squared (η²) is the effect size for one-way ANOVA. It represents the proportion of total variance accounted for by the grouping variable:

\[\eta^2 = \frac{SS_{\text{between}}}{SS_{\text{total}}}\]

η² value Magnitude
0.01 Small
0.06 Medium
0.14 Large
Code
set.seed(42)  
# three register groups with different mean sentence lengths  
sent_length <- c(rnorm(40, 18, 5), rnorm(40, 25, 6), rnorm(40, 12, 4))  
register_grp <- rep(c("Spoken", "Written", "CMC"), each = 40)  
  
# one-way ANOVA  
aov_result <- aov(sent_length ~ register_grp)  
summary(aov_result)  
              Df     Sum Sq     Mean Sq F value                 Pr(>F)    
register_grp   2 3558.08855 1779.044276 64.6461 < 0.000000000000000222 ***
Residuals    117 3219.81034   27.519746                                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# eta-squared via effectsize package  
effectsize::eta_squared(aov_result)  
# Effect Size for ANOVA

Parameter    | Eta2 |       95% CI
----------------------------------
register_grp | 0.52 | [0.42, 1.00]

- One-sided CIs: upper bound fixed at [1.00].

Exercises: Effect Sizes

Q1. A study reports a statistically significant difference in reaction times between two conditions (p = .03) with Cohen’s d = 0.08. How should you interpret this?






Q2. What is the key advantage of reporting effect sizes alongside p-values?






Reporting Descriptive Statistics

Knowing how to compute descriptive statistics is only half the skill. Knowing how to report them clearly and consistently is equally important. This section summarises the conventions used in linguistics and adjacent fields.


General Principles

APA-Style Reporting Guidelines

The APA Publication Manual (7th edition) provides the most widely used conventions in linguistics and psychology:

  • Report mean and SD for continuous normally distributed variables: M = 450 ms, SD = 82 ms
  • Report median and IQR for skewed or ordinal data: Mdn = 430 ms, IQR = 95 ms
  • Round to two decimal places for most statistics; use fewer when precision is not meaningful
  • Report sample size (n) for every group
  • Include effect sizes alongside inferential statistics: t(58) = 3.12, p = .003, d = 0.81
  • For proportions: n = 42 out of 120 (35.0%)
  • Use italics for statistical symbols: M, SD, p, t, r, n

A Model Reporting Paragraph

Here is an example of a well-written descriptive statistics paragraph for a linguistics study:

Reaction times were recorded for 60 participants (30 per condition). In the congruent condition, mean RT was M = 442 ms (SD = 74 ms, Mdn = 431 ms); in the incongruent condition, M = 518 ms (SD = 88 ms, Mdn = 504 ms). RT distributions in both conditions were approximately normal (Shapiro-Wilk: congruent W = 0.97, p = .43; incongruent W = 0.96, p = .29), and no outliers were identified (all values within ±3 SD of the group mean). Four participants were excluded prior to analysis due to error rates exceeding 30%.

Notice what this paragraph includes:

  • Sample size per group
  • Mean and SD (primary) plus median (secondary, as a robustness check)
  • Normality check with test statistic and p-value
  • Outlier handling procedure
  • Exclusion criteria and number excluded

A Model Results Table

For multiple groups or variables, a table is clearer than prose:

Code
set.seed(42)  
report_data <- data.frame(  
  Condition   = rep(c("Congruent", "Incongruent", "Neutral"), each = 40),  
  RT          = c(rnorm(40, 442, 74), rnorm(40, 518, 88), rnorm(40, 478, 81)),  
  Errors      = c(rpois(40, 1.2), rpois(40, 3.1), rpois(40, 2.0))  
)  
  
report_table <- report_data |>  
  dplyr::group_by(Condition) |>  
  dplyr::summarise(  
    n       = n(),  
    RT_M    = round(mean(RT), 1),  
    RT_SD   = round(sd(RT), 1),  
    RT_Mdn  = round(median(RT), 1),  
    Err_M   = round(mean(Errors), 2),  
    Err_SD  = round(sd(Errors), 2),  
    .groups = "drop"  
  ) |>  
  dplyr::rename(  
    `N`          = n,  
    `RT: M (ms)` = RT_M,  
    `RT: SD`     = RT_SD,  
    `RT: Mdn`    = RT_Mdn,  
    `Errors: M`  = Err_M,  
    `Errors: SD` = Err_SD  
  )  
  
report_table |>  
  flextable::flextable() |>  
  flextable::set_table_properties(width = .9, layout = "autofit") |>  
  flextable::theme_zebra() |>  
  flextable::fontsize(size = 12) |>  
  flextable::fontsize(size = 12, part = "header") |>  
  flextable::align_text_col(align = "center") |>  
  flextable::set_caption(  
    caption = "Descriptive statistics by condition. M = mean; SD = standard deviation; Mdn = median.") |>  
  flextable::border_outer()  

Condition

N

RT: M (ms)

RT: SD

RT: Mdn

Errors: M

Errors: SD

Congruent

40

439.1

90.5

434.6

1.00

0.99

Incongruent

40

525.0

80.6

542.7

2.78

2.02

Neutral

40

481.9

78.4

480.2

1.98

1.33


Exercises: Reporting

Q1. A researcher writes: “The mean response time was 487ms.” What crucial information is missing?






Q2. For a 7-point Likert scale measuring attitude, which statistics should be reported?






Schweinberger, Martin. 2026. Descriptive Statistics with R. Brisbane: The University of Queensland. url: https://ladal.edu.au/tutorials/dstats/dstats.html (Version 2026.02.11).

@manual{schweinberger2026dstats,  
  author       = {Schweinberger, Martin},  
  title        = {Descriptive Statistics with R},  
  note         = {https://ladal.edu.au/tutorials/dstats/dstats.html},  
  year         = {2026},  
  organization = {The University of Queensland, Australia. School of Languages and Cultures},  
  address      = {Brisbane},  
  edition      = {2026.02.11}  
}  
Code
sessionInfo()  
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] effectsize_1.0.0  e1071_1.7-16      cowplot_1.2.0     checkdown_0.0.13 
 [5] Rmisc_1.5.1       plyr_1.8.9        lattice_0.22-6    psych_2.4.12     
 [9] ggpubr_0.6.0      flextable_0.9.7   ggplot2_4.0.2     stringr_1.5.1    
[13] dplyr_1.2.0       DescTools_0.99.59 boot_1.3-31      

loaded via a namespace (and not attached):
 [1] mnormt_2.1.1            gld_2.6.7               sandwich_3.1-1         
 [4] readxl_1.4.3            rlang_1.1.7             magrittr_2.0.3         
 [7] multcomp_1.4-28         compiler_4.4.2          systemfonts_1.2.1      
[10] vctrs_0.7.1             pkgconfig_2.0.3         fastmap_1.2.0          
[13] backports_1.5.0         rmarkdown_2.30          markdown_2.0           
[16] haven_2.5.4             ragg_1.3.3              purrr_1.0.4            
[19] xfun_0.56               litedown_0.9            jsonlite_1.9.0         
[22] uuid_1.2-1              broom_1.0.7             parallel_4.4.2         
[25] R6_2.6.1                stringi_1.8.4           RColorBrewer_1.1-3     
[28] car_3.1-3               cellranger_1.1.0        estimability_1.5.1     
[31] Rcpp_1.0.14             knitr_1.51              zoo_1.8-13             
[34] parameters_0.24.1       Matrix_1.7-2            splines_4.4.2          
[37] tidyselect_1.2.1        rstudioapi_0.17.1       abind_1.4-8            
[40] yaml_2.3.10             codetools_0.2-20        tibble_3.2.1           
[43] withr_3.0.2             bayestestR_0.15.2       S7_0.2.1               
[46] askpass_1.2.1           coda_0.19-4.1           evaluate_1.0.3         
[49] survival_3.7-0          proxy_0.4-27            zip_2.3.2              
[52] xml2_1.3.6              pillar_1.10.1           carData_3.0-5          
[55] renv_1.1.1              insight_1.0.2           generics_0.1.3         
[58] hms_1.1.3               commonmark_2.0.0        scales_1.4.0           
[61] rootSolve_1.8.2.4       xtable_1.8-4            class_7.3-22           
[64] glue_1.8.0              gdtools_0.4.1           emmeans_1.10.7         
[67] lmom_3.2                tools_4.4.2             data.table_1.17.0      
[70] ggsignif_0.6.4          forcats_1.0.0           Exact_3.3              
[73] mvtnorm_1.3-3           grid_4.4.2              tidyr_1.3.2            
[76] datawizard_1.0.0        nlme_3.1-166            Formula_1.2-5          
[79] cli_3.6.4               textshaping_1.0.0       officer_0.6.7          
[82] fontBitstreamVera_0.1.1 expm_1.0-0              gtable_0.3.6           
[85] rstatix_0.7.2           digest_0.6.39           fontquiver_0.2.1       
[88] TH.data_1.1-3           htmlwidgets_1.6.4       farver_2.1.2           
[91] htmltools_0.5.9         lifecycle_1.0.5         httr_1.4.7             
[94] fontLiberation_0.1.0    openssl_2.3.2           MASS_7.3-61            

Back to top

Back to LADAL home


References

Bickel, Peter J, and Erich L Lehmann. 2012. “Descriptive Statistics for Nonparametric Models IV. Spread.” In Selected Works of EL Lehmann, 519–26. Springer. https://doi.org/https://doi.org/10.1007/978-1-4614-1412-4_45.
Gaddis, Gary M, and Monica L Gaddis. 1990. “Introduction to Biostatistics: Part 2, Descriptive Statistics.” Annals of Emergency Medicine 19 (3): 309–15. https://doi.org/https://doi.org/10.1016/s0196-0644(05)82052-9.
Thompson, Cheryl Bagley. 2009. “Descriptive Data Analysis.” Air Medical Journal 28 (2): 56–59. https://doi.org/https://doi.org/10.1016/j.amj.2008.12.001.